|
Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As is typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes: #To provide an analytical approximation to the posterior probability of the unobserved variables, in order to do statistical inference over these variables. #To derive a lower bound for the marginal likelihood (sometimes called the "evidence") of the observed data (i.e. the marginal probability of the data given the model, with marginalization performed over unobserved variables). This is typically used for performing model selection, the general idea being that a higher marginal likelihood for a given model indicates a better fit of the data by that model and hence a greater probability that the model in question was the one that generated the data. (See also the Bayes factor article.) In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods — particularly, Markov chain Monte Carlo methods such as Gibbs sampling — for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to directly evaluate or sample from. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, Variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior. Variational Bayes can be seen as an extension of the EM (expectation-maximization) algorithm from maximum a posteriori estimation (MAP estimation) of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically. For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to iteratively update the parameters often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables. ==Mathematical derivation of the mean-field approximation== In variational inference, the posterior distribution over a set of unobserved variables given some data is approximated by a variational distribution, : : The distribution is restricted to belong to a family of distributions of simpler form than , selected with the intention of making similar to the true posterior, . The lack of similarity is measured in terms of a dissimilarity function and hence inference is performed by selecting the distribution that minimizes . The most common type of variational Bayes, known as ''mean-field variational Bayes'', uses the Kullback–Leibler divergence (KL-divergence) of ''P'' from ''Q'' as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as : Note that ''Q'' and ''P'' are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation-maximization algorithm. (Using the KL-divergence in the other way produces the expectation propagation algorithm.) The KL-divergence can be written as : or : As the ''log evidence'' is fixed with respect to , maximizing the final term minimizes the KL divergence of from . By appropriate choice of , becomes tractable to compute and to maximize. Hence we have both an analytical approximation for the posterior , and a lower bound for the evidence . The lower bound is known as the (negative) ''variational free energy'' because it can also be expressed as an "energy" plus the entropy of . 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Variational Bayesian methods」の詳細全文を読む スポンサード リンク
|